Genetic Sex Validation for Sample Tracking in Clinical Testing

Objective Data from DNA genotyping via a 96-SNP panel in a study of 25,015 clinical samples were utilized for quality control and tracking of sample identity in a clinical sequencing network. The study aimed to demonstrate the value of both the precise SNP tracking and the utility of the panel for predicting the sex-by-genotype of the participants, to identify possible sample mix-ups. Results Precise SNP tracking showed no sample swap errors within the clinical testing laboratories. In contrast, when comparing predicted sex-by-genotype to the provided sex on the test requisition, we identified 110 inconsistencies from 25,015 clinical samples (0.44%), that had occurred during sample collection or accessioning. The genetic sex predictions were confirmed using additional SNP sites in the sequencing data or high-density genotyping arrays. It was determined that discrepancies resulted from clerical errors, samples from transgender participants and stem cell or bone marrow transplant patients along with undetermined sample mix-ups.


Introduction
The implementation of next generation sequencing (NGS) technologies in clinical laboratories [1][2][3] typically involves three phases: (i) the pre-analytic phase including sample collection, DNA extraction and shipment; (ii) the analytic phase of NGS library preparation, DNA sequencing, bioinformatics analysis; and (iii) a post-analytic phase including clinical report generation and delivery.Each phase is inherently subject to sample tracking and identi cation errors, with prior reports of more than 46% of errors occurring during the pre-analytical phase, caused by inappropriate test requests, order entry errors, patient misidenti cation, and labelling errors [4].Validation and tracking of sample identity therefore is a basic and important aspect of effective clinical NGS testing.
DNA-based methods for sample tracking include genotyping of short tandem repeats (STRs) or single nucleotide polymorphisms (SNPs) [5].STRs are generally located in non-coding regions, prone to high sequencing error rates, and often require longer than typical sequencing read lengths to precisely de ne the number of repeats, limiting their application.In contrast, SNPs are ubiquitous in the genome and simple to assay [6,7].In this study, a 96-SNP

Fluidigm SNP genotyping assay
Two clinical laboratories harmonized methods for the program[8] and utilized a 96-SNP panel but incorporated different selected SNPs to track samples and determine ancestry.Each 96-SNP panel contained one subset of SNPs on the sex-chromosomes.The rest autosome SNPs are within the target region of the capture design used in the eMERGE program (Supplementary material) [8].Assays were performed according to the manufacturer's recommendations.
The BCM-HGSC-CL's 96-SNP panel replaced 19 of the original Fluidigm SNPtrace 96 sites to match genomic regions speci cally targeted in eMERGE III.The remaining sites included 3 SNPs on Chromosome X and 3 on Chromosome Y [9,10].At the Broad Institute, the chosen SNPs included 95 autosomal SNPs and 1 sex determining assay SNP, covering the AMELX and AMELY gene (AMG_3B) with a sex-speci c 6 base-pair insertion/deletion.

Illumina In nium SNP array assays and NGS
The HumanCoreExome v1-3 BeadChips containing 500K variant sites, including more than 12,900 located on the X chromosome, that are informative for genetic sex prediction, were utilized according to manufacturer's speci cations.DNA sequencing for the eMERGE phase III program has been described previously [8].

Results
The BCM-HGSC-CL and LMM/Broad laboratories utilized the same analytical platform foundation, employing slightly different SNP sites for the assays, but generally similar work ows (Fig. 1), to test for concordance between data generated from the 96-SNP panel genotyping and the DNA sequence data.The average SNP call rates were 97.3% and 97.5% for the 25,015 samples processed at the BCM-HGSC-CL and the LMM/Broad, respectively.When comparing the 96-SNP panel genotype-based sex to reported sex at the time of sample accessioning, a total of 110 (0.44%) non-concordant cases from two testing laboratories were identi ed.The two testing laboratories utilized slightly different work ows to technically validate the sex discrepancies.
At the BCM-HGSC-CL, of the 14,515 samples processed, 73 samples with sex discrepancies were re-tested with the same 96-SNP panel.Identical results were obtained for 70 of the re-tested samples (Table 1).For the remaining 3 cases, where the sex provided on test requisition was male, non-concordant or ambiguous data were observed between the initial and the repeated assays.For two of these samples, the automated software calls from one of each duplicate assays indicated that the DNA source was from individuals with Klinefelter Syndrome (47, XXY).However, further review of the SNP scatter plots for autosome and sex SNPs indicated that the inconsistent sex calls most likely resulted from sample contamination involving a mixture of male and female DNAs (Fig. 2).The third sample was called as female with lower con dence initially.In the repeated assay, one of the X SNPs failed to call due to localization in between clusters in plot analysis.This is most likely due to the female sample mixed up with some DNA sample from another female.1).HumanCore Exome Array results con rmed 96-SNP panel genotyping sex data, including the suspected two contaminated female samples with additional male or other female DNA.
At the Broad/LMM, the reported sex from the test requisition was compared with the genetic sex determined by both the Fludigm genotyping assay and the data from the eMERGE III sequencing panel.Of the 10,500 samples processed, 151 were initially either identi ed as discordant or had no sex determination.For 95 samples, the Fluidigm assay data could not return a sex determination, however the sequencing sex matched the reported sex for each and no further action was taken.For 19 of the remaining 56 samples, the sequencing and reported sex were concordant, but did not match the genotyping determined sex.Further review of these 19 samples showed that the genotyping assay calls were generally borderline or low con dence calls, suggesting sub-optimal performance of the single sex determining SNP as the reason for the data discrepancy, rather than either a sex reporting error at accession or sample mix-up in the testing laboratory.The remaining 37 samples had highly con dent sex determination calls from both theSNP assay and the subsequent DNA sequencing that were concordant, but did not match the site reported sex (Table 1).
Internal tracking showed that none of the 110 con dently identi ed sex discrepant samples occur within the clinical DNA sequencing laboratories and that most errors were likely introduced prior to shipment of samples.Sampling sites identi ed handling errors from test requisitions, sample extraction, and sample handling procedures for 54 cases.Forty-six of these had information that was incorrectly or incompletely entered on the test requisitions and were resolved by examination of other records.In 6 other cases, it was determined that incorrect samples had been shipped from the sampling sites to the genome centers.Biological explanations for the discrepant tracking data were identi ed for an additional 12 cases.In 4 of these 12 cases, further examination of records revealed that the samples were provided by transgender participants.In addition, 8 sex discrepant samples were determined to be from individuals who had received stem cell or bone marrow transplants.Causes of the sample genetic vs. reported sex discrepancy are listed in Table 2.Where possible, the information on test requisition forms was amended and correct clinical reports were issued for 45 cases processed at the BCM-HGSC-CL, or the incorrect samples were replaced and re-processed.Twelve cases sequenced at the BCM-HGSC-CL with sample-mix ups due to unknown causes were withdrawn from the study.
Similarly, 32 unsolved cases sequenced at LMM/Broad were either withdrawn or remain under investigation.

Discussion
To identify sample swaps during the processing of 25,015 clinical samples in the NIH eMERGE III program, two clinical DNA sequencing laboratories rst utilized a Fluidigm-based 96-SNP panel assay to track internal processes.
These analyses indicated no sample swaps had occurred in the time interval between sample arrival at the testing laboratories and the delivery of the nal DNA sequencing data.In contrast, when the test was expanded to predict the concordance between the self-reported sex of participants at the time of their initial enrollment, with a predicted sex-by-genotype, there were 110 discordant samples.A battery of follow-up tests indicated that these likely arose before the materials were received at the clinical DNA sequencing laboratories.The bases of the sample tracking errors at sample collection sites were determined in 66 of the 110 cases (60%), while leaving the remaining 44 cases unsolved and under investigation.Of these 66 resolved cases, the largest source for the initial discordance occurring in 54 cases (81%) arose from clerical or shipping errors (81%).The remaining 12 cases (18% of the 66 solved) had biological underpinnings that explained the discordant results, as 8 were due to stem cell/bone marrow transplants while 4 were from transgender individuals.Future sample collecting procedures should be modi ed to ensure that participants are invited to note these types of events at the time of collection, so that this information is available for quality control.
The 96-SNP panel has proven value for precise sample tracking [11].In general, 20 informative SNP loci are su cient for unique individual sample identi cation [12,13].Other SNP panels have been used for identi cation of human samples[6, 14,15].A low-density QC genotyping array launched by Illumina which includes 15,949 markers has been utilized in genomic-based clinical diagnostics [16].Our studies showed that these two different SNP platforms exhibited consistent results when applied for sex identi cation.In comparison to the use of the Illumina In nium array platform, the work ow for the 96-SNP panel assay is faster (1-day work ow vs 3-day work ow) and more cost-effective.However, the Illumina In nium array platform provides more information on linkage analysis, HLA haplotyping, ethnicity determination and other genetic information in addition to ngerprinting and thus may be preferred in some scenarios.Other commercial systems are also available to substitute for the platforms described here if they provide cost-effective and precise data with similar qualities.
This level of tracking error is unacceptable for ongoing clinical practice, but the study does not represent the levels that will be expected in further clinical programs.At least one laboratory declared their initial sample enrollments as 'research samples' and thus committed to later repeat assays under a fully compliant protocol, to verify any ndings that may impact care.Others were able to quickly identify points of error and rectify their protocols to ensure faithful future sample handling.All sites committed to rechecking of records and reconciling actionable ndings with orthogonal data, including family histories and biochemical tests, before returning results.The 'lessons learned' from these analyses ensure that a repeat of the same program would likely minimize any similar errors.

Limitations
While false positive rates are low for this application of SNP trace, false negative rates will be high.Here, the overall level of genetic and reported sex discordance of 0.44% is likely an underestimate of the true error rate in this study, as the misclassi cation of genetic sex from a random sample swap would be expected to result in incorrect, erroneous assignment, only 50% of the time.The true ratio may be skewed by factors introducing a sex-bias in the direction of misclassi cation.This could be caused by skewed phenotypes of individuals with sex chromosome anomalies or that gender obfuscation may be socially driven in an unequal manner, depending on the gender identity of the individual.Overall, the rate is likely higher than the 0.44% identi ed here, but not anticipated to be higher than twice that level.Figure 2 panel was used to track samples through the clinical NGS work ow in the National Institute of Health's Electronic Medical Records and Genomics Phase III (eMERGE) program [8].The network linked together 11 sample collection sites and 2 clinical genetic testing laboratories, the Human Genome Sequencing Center Clinical Laboratory at Baylor College of Medicine (BCM-HGSC-CL) and the Mass General Brigham Laboratory for Molecular Medicine (LMM) in partnership with the Clinical Research Sequencing Platform (CRSP) at the Broad Institute of MIT and Harvard.A total of 25,015 clinical DNA samples were processed.The 96-SNP panel-based procedure provided a robust method for sample tracking in the clinical NGS work ow and showed that the testing of sex can provide a valuable quality control tool.
Abbreviations HGSC Human Genome Sequencing Center LMM Laboratory for Molecular Medicine NGS Next generation DNA sequencing STR short tandem repeat SNP single nucleotide polymorphism PCR polymerase chain reaction eMERGE Electronic Medical Records and Genomics EMR electronic medical record HGSC-CL Human Genome Sequencing Center Clinical Laboratory BCM Baylor College of Medicine CRSP Clinical Research Sequencing Platform STA speci c target ampli cation LSP locus speci c primer IFC Integrated Fluidic Circuit ASP Allele Speci c Primers ACMG American College of Medical Genetics NHGRI National Human Genome Research Institute IRB institutional review board

Table 1
Comparison of genetic sex determined in various assays and reported sex on test requisition *: Insu cient gDNA for Illumina array; **: Sex not reported on requisition form; ***: sex not called in assay; NA: not available Next, Illumina HumanCore Exome Arrays were utilized as an orthogonal high-density hybridization genotyping assay to further test 71 of the 73 samples with sex inconsistencies except two samples which had insu cient genomic DNA (Table